Project Assignment B - San Francisco 311 Cases

An investigation of the public service requests of San Fransisco.

Table of content:
  1. Motivation
  2. Baisc Stats
  3. Data Analysis
  4. Genre
  5. Visualizations
  6. Discussion
  7. Contributions

Part 1: Motivation

This section aims at clarifying, what the motivation behind the project is, what we are going to investigate and how we are going to proceed.

The people on Earth live increasingly in highly-dense cities. San Francisco is no exception to this trend. The people and environment surrounding one is playing a critical role for one's happiness and overall well-being. However, very little information is given on the behavoir of neighbours and the state of the neighbourhood when acquiring an appartment or house.

We aim at creating an interactive website, to provide the residents (and newcomers) of San Fransisco a tool to self-navigate around the city, investigate the state of each neighbourhood, get to better know the people living there and find the best suitable neighbourhood to a set of preferences. In the end, the goal is to allow the residents to:

  1. explore thorougly the characteristics of each neighbourhood, including...
    • ... differences between the neighborhoods with respect to the amount of disturbances (trash, defects, noise etc.)
    • ... how quickly disturbances are resolved by the responsible agency in a given neighborhood.
    • ... temporal patterns of disturbances (e.g. hourly, weekly, monthly patterns)
  2. learn about their city and neighborhood in selected data stories
  3. obtain a recommendation on the best suitable neihgbourhood based on customised preferences

To achieve these objectives we build on three data sets, which are explained in the following.

Datasets

  1. San Francisco 311 Cases

  2. California House Prices

  3. San Francisco Neighborhood Socio-Economic Profiles

Reflection on choice of data

After exploring the data availabe in the Data SF, the San Francisco's city hall open data portal, the dataset 311 Cases was selected at first.

  1. San Francisco 311 Cases
  1. California House Prices
  1. San Francisco Neighborhood Socio-Economic Profiles

Reflection on user experience

The visuals we create using the afforementioned data sets have one thing in common: user interactivity. The user should be capable of exploring deeper those elements that might be more interesting for himself.

Part 2: Basic Stats

This section aims at explaining how the analysed data was extracted and what processing techniques were applied to secure data quality.

Setup: Import Libraries

Setup: Helping functions

We created a few helping functions to make data processing more handy.

Setup: Load data

We start out by loading the data and filter it according to the variable that are most relevant for our analysis.

Part 2.1: Write about your choices in data cleaning and preprocessing

Part 2.1.1: San Francisco 311 dataset

Let's have a first look at the 311 data.

For our preliminary defined analysis, the most important columns of the San Francisco dataset are Categoryand Neighborhood. Let's start by taking a closer look at those two columns.

When creating a 311 request, it is assigned to a specific Category (e.g. Noise Report or Graffiti). Within these categories, each request is assigned to a subcategory called Request Type. Above that, the user is then also able to specify some Request Details. In order to get an overview of the different requests the data set contains, let's have a look at how many there are.

Alright, already on the top level Category we have 108 different categories. In order to make our visualizations most relevant for the user and to not overwhelm with too many details, we choose to filter the dataset to only contain a set of most important categories, which we consider most relevant for the analysis. We chose categories that are of major concern when deciding for a neighborhood, i.e. issues that are particularly annoying, unpleasant or disturbant.

Let's start by looking at the distribution of the 101 different catagories.

Looking a the distribution one can see a highly right-skewed distribution with a long tale of many sparse catagories. We choose 15 of the top 25 categories to be our focus categories. To our judgement, these 15 categories best describe disturbant behavior. In the above plot, the Focus Categoriesare highlighted in red.

The dataset now contains 3.725.194 service requests, i.e. ~15% of the service requests are removed.

However, the focus-categories include names such as General Request - PUBLIC WORKS, Sidewalk or Curb etc. which are not that meaningful to the end-reader. To make it even easier for us to visualize and for the end-reader to understand, we choose to group the focus-categories into 6 main categories as shown below.

This grouping was created after a thorough investigation of the images attached to the service requests.

Now let's take a closer look at the important column, Neighborhood.

When creating a 311 request, it is assigned to a location (longitude and latitude). This is location is mapped to the official police districts of San Francisco, above that, the user is then also able to specify a Neighborhood. In order to get an overview of the different requests the data set contains, let's have a look at how many there are.

Alright, on the top level 'Police District we have 12 different districts. In order to make our visualizations most relevant for the user, we want to provide a more detailed division of San Francisco. However, the neighborhoods seems to be typed in manually by the user, e.g. including neighborhoods such as NaN, 8 and other spelling mistakes. Therefore, we choose to utilize the offical Neighborhood notification boundaries created by Department of City Planning (1) as standard and map neighborhoods to this standard.

Let's take a look at the distribution of the 37 different neighborhoods.

At last, we transform and create temporal features used for further analysis.

Part 2.1.2: San Francisco House Prices

In order to join the San Francisco House Prices dataset with the San Francisco 311 dataset, we had to come op with a shared key to combine. Since our analysis focus around the neighborhoods of San Francisco, this seemed like an obvious option. However, the San Francisco House Prices dataset only includes the address and a zipcode. Unfortunately, some neighborhoods share the same zipcode. Thus, we utilised the Google Maps API to extract the longitude and latitude of each sold house based on the address. This enables us to decide which neighborhood a house belongs to by matching the geographical location to each neighborhood represented by a polygon in the .geojson file.

Warning: Running the below code takes a considerable amount of time!

Thus, we manage to map all ~10.000 sold houses to 36 unique neighborhoods, where the only one not represented is Golden Gate Park. As the name refers, this is a park populated by few people.

Part 2.1.3: San Francisco Neighborhood Socio-Economic Profiles

Since the dataset on San Francisco Neighborhood Socio-Economic Profiles was transformed into a CSV manually no further pre-processing was needed.

Part 2.2: Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

The first part of the exploratory analysis will focus on spatial patterns.

Part 2.2.1: Spatial patterns

Part 2.2.1 a: Neighborhoods

First, let's have a look at the location of the different neighborhoods. We will use a Folium choropleth for this purpose.

The plot gave us insights on one important matter

Part 2.2.1 b: Spatial distribution for selected categories

In the next step we want to visualize how the different categories are distributed across space. We will use Folium and plot the cases on a map, using their latitude/longitude coordinates. Due to heavy computations when using Folium, we will only plot two categories here and restrict the time period to the year 2020. Also we will use two categories that have approximately the same amount of requests in this time period and from which we expect different spatial patterns. We chose Sewer Issues (associated with more calm residential areas) and Litter Receptacles (associated with more crowded areas).

The two plots look more alike than we expected, but we can still identify some differences:

Part 2.2.2 : Temporal patterns

This subsection will aim at getting insights into temporal distributions of the six main categories. We will start at the highest level (the years), continue with monthly and weekly distributions and finally end with how the data is distributed throughout the hours of the day and the week.

Part 2.2.2 a : Yearly distributions

Let's start by having a look at how the data is distributed across the years. The data set contains requests from July 2008, when SF311 started taking web requests, until today. We therefore expect the data to grow continuously throuout the years.

From the above plots we can make some interesting observations:

Part 2.2.2 b : Monthly patterns

The monthly distribution reveal some interesting patterns, too:

Part 2.2.2 c : Weekly patterns

Insights:

Part 2.2.2 d : Patterns for the hours of the day

We again gain interesting insights into reporting behaviour:

Part 2.2.2 e : Patterns for the hours of the week

The hours of the week conform our findings from the previous plot and give us even more detailed insights:

Part 2.2.3 : Resolution time

As a next step, we will look into resolution times, i.e. how quickly 311 cases are solved for the different categories. For this, will introduce a new variable TimeToClose.

We drop all requests that have not been solved yet or have a closing time smaller than zero (case solved before opened).

As our data is now preprocessed, we can create a boxplot to see how resolving time is distributed for our six main categories.

Examining the bodies of the box plots (disregarding the "outliers" for now), one can observe a significant difference in resolution time. Mobility restrictions like parking enforcements are usually solved within a few hours. Defects in contrast have resolution times of several days, up to months. These observations make sense as parking offenders are usually identified very quickly whereas defect streets might take the authorities months to fix.

Generally it must be stated that each of these categories contain very heterogenous requests. Noise requests, for example, can range from a party lasting a few hours to street works taking several months. At the same time, these requests differ in frequency. Notice that each of the box plots contains a tremendous amount of "outliers" (in fact, the y-axis of this plot is limited, there are even more than the ones you can see in the plot), which indicates an extreme discrepancy between resolving times, even within categories. To make this issue of highly unvenly distributed resolving time more visible, we decided to plot them as histograms, using a log scale on the y-axis.

The above histograms confirm the uneven distribution. Bar heights decrease linearly, indicating that associated categories decrease exponentially in frequency. We will need to be aware of this fact for the further analysis.

Part 3: Data Analysis

The focus of the data analysis has pivoted in two:

  1. Where is the best neighborhood for partying?
  2. Where do the slum dog millionaires live?

These two independent stories aim at inspiring and preparing the end-reader how he/she can go about generating insights on their own and what types have then been related with the rest of data to show different patterns.

Part 3.1.1: Where is the best neighborhood for partying?

The first story we want to tell is about noise in the different neighbourhoods. Considering that Mission is a vibrant district with a lot of ethnical diversity and nightlife (4), is is not surprising that it is one of the cities noisiest neighbourhoods on the map.

Looking at the hourly patterns, you can see that noise is most prominent in the evening and night hours. This reflects the district’s extensive nightlife and its demographics: the neighbourhood is home to many people of Latin origin who tend to live later in the day (5).

Another interesting aspect is to look at the bubble chart and the median resolution time of noise requests. Comparing Mission to more affluent residential areas in the north of the city (Presidio, Marina, Pacific Heights) you can see that noise requests are solved more quickly in Mission. One reason for this can be that noise requests originate mainly from temporally limited events like parties etc. In residential areas they presumably originate from more long-term sources like street works. Relatively many requests during the day seem to confirm this hypothesis.

Part 3.1.2: Where do the slum dog millionaires live?

Another fact we learned from the data set is that sewer issues become a more serious problem the further away one moves from the city center. Select different neighbourhoods on the map and you will see the red bubble becoming bigger and bigger the further out the neighbourhood is located.

We can assume that the high numbers of sewer issues on the outskirts can be explained by the structural differences of the city. Since the areas further away from the city center are usually residential areas (producing a lot of sewage), sewer issues are naturally more common here. On the other hand, streets and other public infrastructure is more stressed in the lively city center, leading to many other defects and reducing the share of sewer issues there.

Part 3.2: If relevant, talk about your machine-learning.

We attempted to apply machine learning to find the hot-spots of disturbance. Letting X be a spatial dataset of ~1.5 millions points of latitude-longitude related to 311 service requests. Utilizing a Scikitlearn algorithm that works well with arbitrary distances, the DBSCAN (Ordering Points To Identify the Clustering Structure) algorithm to cluster coordinates by finding core samples of high density and expands clusters from them.

Our thought was to leverage DBSCAN to study the spatial patterns of disturbance related service requests in San Francisco. DBSCAN groups these locations (longitude and latitude) of disturbance into clusters. By finding the cluster centroid for each cluster, we get the hot-spots of disturbance. These hot-spots are the centroids of each cluster and, therefore, at a minimum distance from all the points of disturbance of a particular cluster, henceforth, the centroids would represent an area where most frustration and anger can be caused by neighbors.

However, due to limited computational ressources, we did not possess enough memory to cluster all service requests of 2020 and 2021.

Part 4: Genre. Which genre of data story did you use?

The chosen genre is part Annotated Graphs and part Partitioned Poster following the structure of a Martini Glass. First, we seek an author-driven storytelling marked by deliberate ordering of scenes, heavy messaging and restricted filtering. Second, we seek a reader-driven storytelling marked by no prescribed ordering, few messaging and free interactivity.

The tools used for Visual Narrative (Figure 7 in Segal and Heer):

The tools used from Narrative Structure (Figure 7 in Segal and Heer):

The design elements waswere chosen to accomodate the end-reader with the Martini Glass structure of narrating, thus, one can freely explore the data after been taught how to go about fishing for information and insights.

Part 5: Visualizations.

The chosen visualization have been:

  1. Interactive San Francisco Map:

    • Chosen because... each record in the dataset is associated with a coordinate. These coordinates are essential to investigate spatial patterns related to disturbances in San Francisco. Interactive maps are tailored to visualize spatial data and a great way to compare neighborhoods versus each other.
    • Provides the storytelling... with an interactive filter mechanism and an introductory view to the San Francisco Bay area forming a point of departure. The map is the main plot and the crux of our storytelling. Most importantly, it helps shedding light on the state of each neighbourhood in one view, by showing how disturbance is distributed over the city.
  2. Interactive Bar Charts:

    • Chosen because... each record in the dataset is associated with a timestamp. These timestamps are essential to investigate temporal patterns related to disturbances in San Francisco. A bar chart is a great choice for looking at how a given variable moved over time or to compare variables versus each other.
    • Provides the storytelling... with a complete view to investigate temporal patterns. The end-reader can easily shift between the temporal variables; yearly patterns, monthly patterns, weekly patterns or hourly patterns, answering how disturbance has moved over time.
  3. Interactive Scatter Plot:

    • Chosen because... each record in the dataset is associated with a category describing the type of disturbance. The bubble chart is a great choice to compare entities and display them with three dimensions of data.
    • Provides the storytelling... with an overview of how frequent the different categories (e.g. Street Defects, Sewer Issues) within the selected main category are and how quickly these requests are solved in a certain neighborhood. Gives the end-reader an idea of what kinds of disturbances characterize each neighborhood and how severe each type tends to be.
  4. Interactive San Francisco map with added sliders:

    • Chosen because... the aim of providing the end-reader guidance on where to acquire property in San Francisco needs to be an area marked on a map.
    • Provides the storytelling... with a "best match" neighborhood based on multiple customized preferences, such as the main categories, price etc. The storytelling ends here with the final conclusion visualizing on where the end-reader is recommended to acquire a potential property.

Part 5.1: Figure 1: Interactive San Francisco map

First, we reduce the dataset from ~3,7 million rows to 37 rows, counting the number of service requests for each neighborhood.

Then we provide addtional information to each neighborhood in sense of their socio-economic profile. This information will be included in the tool tip of Figure 1, i.e. when hovering over a neighborhood one will see the socio-economic profile.

We plot the choropleth mapbox utilizing the official SF Planning Neighborhoods geojson file (1) to provide the plot with the borders of each neighborhood. We color each area according to the number of service requests.

Part 5.2: Figure 2: Interactive bar charts

As in Part 2.2.2, we reduce the dataset by grouping all service requests according to a given temporal variable, while counting the total number of requests for each time unit. This is done for the yearly, monthly, daily and hourly pattern.

Utilizing plotly.graph_objects we create a bar chart for each temporal variable, overlay them on top of each other, and make a navigation button for each of them to switch between what plot to display.

Part 5.3: Figure 3: Interactive Scatter plot

With Figure 3 we aim to empower our users to compare the characteristics of different neighborhoods with repect to request types, their frequency and respective resolution times.

Due to the highly skewed distributions for the resolution time, we will aggregate the resolution time using the median for each category.

Furthermore, we will create a column that counts the requests, so that we can calculate its share of the total requests.

We will design the bubble chart as follows:

Part 5.4: Figure 4: Interactive San Francisco map with added sliders

With our last visualization we aim to provide our users with unique insights into which neighborhoods suits their personal needs best. To achieve this, we will create a highly interactive and personalized map, where the user can state how important he/she considers different aspects of a neighborhood. On a scale from 1 (not important) to 5 (very important) the user will be able to rate his/her perceived importance of

For this purpose, we developed a simple and easy to understand algorithm that calculates an individualized convenience score for each neighborhood based on the user's preferences. We will plot this convenience score on a map and thereby help our users to identify the neighborhood that suits their needs best.

Let's start with the coding part. First, we will store the user's preferences in a dictionary.

Now, let's get the data that we need.

We will group the data by neighborhood and aggregate it in two ways:

  1. counts (to calculate incidents per inhabitant)
  2. median resolution time

We will keep the median resolution time in a separated data frame for now as we want to weigh it according to the weights the user specifies for the different main categories. Let's start working on the counts, first. We include the population data and divide the request counts for the different neighborhoods by their population. We will obtain requests per inhabitants and thereby make the counts for the different neighborhoods more comparable.

Next, we add the house price data to the data frame. Assuming a skewed distribution of housing prices, we will use the median price level for each neighborhood.

The data misses housing prices for Treasure Island/YBI and Golden Gate Park.

Alright, now that we have all features ready, we will normalize them in order to obtain a score that lies between 0 and 1. For this purpose we use a min/max-scaler that is a built-in function in sklearn. We will normalize both the category counts, the price level and the resolution time.

Perfect. Now we can add the normalized median resolution time to the rest of our features. Since we want to weigh the normalized resolution time for the different main categories with respect to the weight that the user determines for these categories, we will calculate a weighted resolution time score for each neighborhood as follows: $$weighted\_resolution\_time\_score_{n} = \frac{1}{\sum_{c \in C} w_c}\sum_{c \in C}t^{res}_{c,n} * w_c$$ where

Looking good! Let's have a look at the stats to see if everything worked as expected.

We can see that

Alright, we're almost there. Final step: Calculating the overall convenience score for each neighborhood $n$ by caclulating a weighted average of the differnt features. Since our current feature scores respresent inconvenience (the higher the number of requests per inhabitant, the price level or the resolution time, the more inconevient for the user) we will subtract the feature scores from 1 in order to ontain the convenience score. $$ convenience\_score_n = 1 - (\frac{1}{\sum_{m \in M} w_m} \sum_{m \in M} x_{n,m}*w_m) $$ where

What a great plot with super valuable insights for our users!

Part 6: Discussion. Think critically about your creation

  1. What went well?

    • We conducted a thorough exploratory data analysis.
    • Additional dataset (socia-economic data, house prices) enriched our work significantly.
    • The tool to identify the most suitable neighborhood is a valuable and insightsful tool that engages the user in an interactive and personalised way.
    • Good structure of the notebook and our steps and methods are well explained.
    • Clear layout of the website.
    • We handled website challenges (e.g. reduce the size of the dataset to prevent exploding memory.
    • We covered a lot of the content of the class.
    • Plotly online community is not as big as other libraries' community. It has been a pain to configure some small details of the charts. However, we managed to comply.
  2. What is still missing? What could be improved? Why?

    • Succesfully implementation of a machine learning model telling where the hot spots of disturbance are located in San Francisco.
      • This will provide more detailed information to our choropleth map.
      • It could be that disturbance are more located in the southen end of a neighborhood.
    • Optimization. The running time of the website could be improved further.
      • Reducing the datasets to bits and bytes would decrease use of memory.
      • Optimizing the source code would decrease runtime.
      • In the end, it will provide a better user experience.

Part 7: Contributions. Who did what?

The following list shows the main responsible for each section of the project.

Part 1: Andreas
Part 2.1: Andreas
Part 2.2: Lukas
Part 3.1: Lukas
Part 3.2: Andreas
Part 4: Andreas
Part 5.1: Andreas
Part 5.2: Andreas
Part 5.3: Lukas
Part 5.4: Lukas
Part 6: All
Website: Fadi

All group members supported the main responsible of each section.

References

1: Planning Neighborhood Groups Map, May 2021, https://data.sfgov.org/Geographic-Locations-and-Boundaries/Planning-Neighborhood-Groups-Map/iacs-ws63

2: Weather and Climate. Accessed: April 29, 2021. https://weather-and-climate.com/average-monthly-precipitation-Rainfall-inches,San-Francisco,United-States-of-America

3: Changes to the Dataset: Case Data from San Francisco 311 (SF311), June 2019. Download: https://data.sfgov.org/api/views/vw6y-z8j6/files/11faf643-7bd1-49fd-bdc2-f0c92212587b?download=true&filename=Open%20Data%20Changes%20June%202019.pdf

4: San Francisco Neighborhoods. Wikipedia. Accessed: May 12, 2021. https://en.wikipedia.org/wiki/San_Francisco#Neighborhoods

5: San Francisco Neighborhoods - Socio-Economic Profiles (American Community Survey 2012–2016). San Francisco Planning Department, 2018. https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf